[1] 0.2482193
Part 3. Learning from random samples (part b)
Say our estimand \(\theta\) is a population mean, i.e. \({\textrm E}[X]\)
We have a plug-in estimator \(\hat{\theta}\), the sample mean \(\overline{X}_{(n)}\).
\(\overline{X}\) has a sampling distribution. What do we know about it?
What can we do with this knowledge?
Suppose we knew \({\textrm E}[X]\) and \({\textrm V}[X]\).
Then we know the asymptotic sampling distribution of \(\overline{X}\), i.e. \(N\left({\textrm E}[X], \frac{{\textrm V}[X]}{n}\right)\).
We could then compute, e.g. the probability of observing a sample mean below 2 if \({\textrm E}[X] = 2.104\), \({\textrm V}[X] = 4.63\), and \(n = 198\).
Similar if we only knew \({\textrm E}[X]\) but not \({\textrm V}[X]\).
But we don’t know \({\textrm E}[X]\) (that’s why we’re doing research).
We have an estimate \(\overline{X}\), and we can estimate \({\textrm V}[X]\) from the sample. What more can we say about \({\textrm E}[X]\)?
Reporting \(\hat{\sigma}[X] = \sqrt{\frac{\hat{{\textrm V}}[X]}{n}}\), the standard error, is a start.
Approaches we will take:
Generally, these tools of statistical inference require
Does not require anything else about the underlying data (e.g. normality).
\(\implies\) “agnostic statistics”.
Could we specify a range that is likely (e.g. 95% likely) to include \(\theta\)?
That is the goal of constructing a confidence interval.
Lazy confidence intervals that are certain to include \(\theta\):
We seek smaller ones that are likely to include \(\theta\).
An interval \(CI\) is a valid confidence interval for \(\theta\) with coverage \((1 - \alpha)\) if
\[\text{Pr}[\theta \in CI] \geq 1 - \alpha\]
Typical to choose \(\alpha = .05\), so the CI’s coverage is .95.
In the frequentist view, \(\theta\) is fixed and \(CI\) is the random variable; the probability statement is about repeated samples.
Suppose we know that \(\hat{\theta}\) is distributed normally with mean \(\theta\) and variance \(\sigma^2\) (i.e. \(N(\theta, \sigma^2)\).
For now, suppose we know \(\theta\). (Remember, in real life we don’t.)
What is the shortest interval \([a,b]\) that will contain \(\hat{\theta}\) 95% of the time?
Because \(\hat{\theta}\) is normally distributed, the shortest interval \([a,b]\) that will contain \(\hat{\theta}\) 95% of the time is \[[\theta - 1.96 \sigma, \theta + 1.96 \sigma]\]
For 90% interval, \[[\theta - 1.64 \sigma, \theta + 1.64 \sigma]\]
Everyone use R to draw a single value from normal distribution with mean 4 and sd 2.
What proportion of draws are
We have an interval that contains 95% of \(\hat{\theta}\) draws, given \(\theta\) and \(\sigma\).
We want an interval that contains \(\theta\) 95% of the time, given \(\hat{\theta}\) and \(\hat{\sigma}\).
Consider this interval:
\[\left[ \hat{\theta} - 1.96 \hat{\sigma}, \hat{\theta} + 1.96 \hat{\sigma} \right] \]
We can construct it without knowing \(\theta\), and (asymptotically) it contains \(\theta\) 95% of the time!
Everyone use R to draw a sample of size \(n=400\) from normal distribution with mean 4 and sd 2. Using your sample, make a 90% confidence interval for the population mean:
Does your CI include \(4\) (the population mean)?
As a frequentist concept, the confidence interval is about a long-run average: if I make many 95% (valid) confidence intervals, 95% of them will contain the true value.
Recall, for a valid 95% CI \(\text{Pr}(\theta \in \text{CI}) = 95\%\).
\(\theta\) is not a random variable; this is a probability statement about the frequency of the CI including \(\theta\), not your beliefs about where \(\theta\) is.
But in the absence of other information, a Bayesian would say “There is a 95% probability that \(\theta\) is in this CI.”
With confidence intervals, we report an interval centered on estimate \(\hat{\theta}^*\) that is likely (in either frequentist or Bayesian sense) to contain the estimand \(\theta\).
Another approach: hypothesis testing.
Basic idea:
The more unlikely your result \(\hat{\theta}^*\) would be under the null (i.e. the lower the \(p\)-value), the more doubtful the null hypothesis appears.
Similar to proof by contradiction (modus tollens):
But it’s a probabilistic version (weak syllogism):
The latter conclusion is logically warranted (by Bayes’ rule) if \(\text{Pr}(\text{no call} \mid \text{love}) < \text{Pr}(\text{no call} \mid \text{no love})\).
But the conclusion that “he probably doesn’t love me” is not – it depends on how confident you were of his love before.
We have:
\[\begin{align} \text{Pr}[\text{love} \mid \text{no call}] &= \frac{\text{Pr}[\text{no call} \mid \text{love}] \text{Pr}[\text{love}]}{\text{Pr}[\text{no call}]}\\ \text{Pr}[\text{no love} \mid \text{no call}] &= \frac{\text{Pr}[\text{no call} \mid \text{no love}] \text{Pr}[\text{no love}]}{\text{Pr}[\text{no call}]} \end{align}\]
The ratio between them:
\[\overbrace{\frac{\text{Pr}[\text{love} \mid \text{no call}]}{\text{Pr}[\text{no love} \mid \text{no call}]}}^{\text{Posterior odds}} = \overbrace{\frac{ \text{Pr}[\text{no call} \mid \text{love}] }{\text{Pr}[\text{no call} \mid \text{no love}] }}^{\text{Likelihood ratio}} \overbrace{\frac{\text{Pr}[\text{love}]}{\text{Pr}[\text{no love}]}}^{\text{Prior odds}}\]
Posterior odds lower than prior odds \(\iff\) likelihood ratio \(< 1\).
Remember: for large enough \(n\), a plug-in estimator \(\hat{\theta}\) is normally distributed around estimand \(\theta\) if
But we don’t know \(\theta\).
But we can say, “Suppose \(\theta = \theta_0\)”.
In that case, we know that*
\[\hat{\theta} \sim N(\theta_0, \hat{\sigma}^2)\] and we can estimate the probability that \(\hat{\theta}\) would be in any interval, given \(\theta_0\).
*Asymptotically, and if \(\text{V}[\hat{\theta} \mid \theta = \theta_0] = \text{V}[\hat{\theta}]\)
Lower one-tailed p-value:
\[\text{Pr}[\hat{\theta} \leq \hat{\theta}^*] \, \, \text{assuming} \,\, \theta = \theta_0\] i.e. “under the null”.
Upper one-tailed p-value:
\[\text{Pr}[\hat{\theta} \geq \hat{\theta}^*] \, \, \text{assuming} \,\, \theta = \theta_0\] i.e. “under the null”.
Two-tailed p-value: \(\text{Pr}\left[\lvert\hat{\theta} - \theta_0\rvert \geq \lvert\hat{\theta}^* - \theta_0\rvert \right]\) assuming \(\theta = \theta_0\), i.e. “under the null”.
It is useful to transform our estimate \(\hat{\theta}^*\) into a \(t\)-statistic:
\[t = \frac{\hat{\theta}^* - \theta_0}{\sqrt{\hat{\text{V}}[\hat{\theta}]}} \]
In words: the difference between the estimate and the null, divided by the standard error of the estimator.
\(|t|\) gets bigger when
If the null is true (\(\theta = \theta_0\)), then asymptotically \(t \sim N(0, 1)\).
In principle, both interesting. One-tailed especially relevant if the null hypothesis is really that \(\theta \geq \theta_0\).
In practice, we only use one-tailed tests when the test is pre-registered. Otherwise two-tailed to be conservative, given the way \(p\)-values are used in testing.
Convention (credited to Fisher) is: “Reject null hypothesis if \(p < .05\).” When the null is true, rejection should occur 5% of the time.
Rejection of null hypothesis commonly interpreted (by seminar audiences, reviewers, editors, hiring committees) as “finding something”.
So researchers really want to reject null.
Many “best practices” are about minimizing cheating:
In this context, one-tailed p-values are seen as cheating.
Intuitively, a low \(p\)-value means “if the null hypothesis (that \(\theta = \theta_0\)) were true, we would infrequently encounter a result as extreme as the one that we saw. Therefore, if we reject the null hypothesis (that is, if we conclude that \(\theta \neq \theta_0\)) based solely on how extreme the result is, then that decision will be a mistake either infrequently (if \(\theta = \theta_0\)) or never (if \(\theta \neq \theta_0\)).” (Aronow & Miller, p. 128)
So how often will it be a mistake? i.e. what is \(\text{Pr}[\theta = \theta_0 \mid \text{reject}]\) (shifting to Bayesian perspective)?
Based on above, sounds like our rejections are incorrect “between infrequently (\(\alpha\)) and never”.
What is \(\text{Pr}[\theta = \theta_0 \mid \text{reject}]\) (probability a rejection is a mistake)?
Use Bayes’ Rule (problem set 2): \[\begin{align} \text{Pr}[\theta = \theta_0 \mid \text{reject}] &= \frac{\text{Pr}[\text{reject} \mid \theta = \theta_0 ] \text{Pr}[\theta = \theta_0]}{\text{Pr}[\text{reject} \mid \theta = \theta_0 ] \text{Pr}[\theta = \theta_0] + \text{Pr}[\text{reject} \mid \theta \neq \theta_0 ] \text{Pr}[\theta \neq \theta_0]} \\ &= \frac{\alpha p_0}{ \alpha p_0 + \text{Power} (1 - p_0)} \end{align}\] where \(\alpha = \text{Pr}[\text{reject} \mid \theta = \theta_0 ]\), \(p_0 = \text{Pr}[\theta = \theta_0]\) and \(\text{Power} = \text{Pr}[\text{reject} \mid \theta \neq \theta_0 ]\).
Suppose \(\alpha = .05\) (standard) and \(p_0 = .5\) (good chance \(\theta = \theta_0\)).
Then
Suppose 200 tests will be performed, and \(\text{Pr}[\theta = \theta_0] = .5\).
Good situation (power = .8):
Then only \(5/85 \approx .056\) of the rejections were mistakes.
Bad situation (power = .05):
Then \(5/10 = .5\) of the rejections were mistakes.
Intuitively, … “if we reject the null hypothesis…based solely on how extreme the result is, then that decision will be a mistake either infrequently (if \(\theta = \theta_0\)) or never (if \(\theta \neq \theta_0\)).” (Aronow & Miller, p. 128; emphasis added)
If “that decision” means “rejecting the null hypothesis”, then “infrequently” is wrong: \[\text{Pr}[\text{rejection is mistake} \mid \theta = \theta_0] = 1\]
If “that decision” means “rejecting only if the result is sufficiently extreme”, then “never” is wrong: \[\text{Pr}[\text{fail to reject} \mid \theta \neq \theta_0] < 1\]
Note that a high \(p\)-value does not offer the same guarantees for those looking to accept a null hypothesis and is accordingly limited in its utility for decision making. (Aronow & Miller, p. 128)
Is this true? What do we learn from a high \(p\)-value?
Again Bayes’ Rule says it depends on the likelihood ratio:
\[\begin{align} \frac{\text{Pr}[ \theta = \theta_0 \mid \text{high p-value}]}{\text{Pr}[ \theta = \theta_1 \mid \text{high p-value}]} &= \frac{\frac{\text{Pr}[ \text{high p-value} \mid \theta = \theta_0 ] \text{Pr}[ \theta = \theta_0 ]}{\text{Pr}[ \text{high p-value}]}}{\frac{\text{Pr}[ \text{high p-value} \mid \theta = \theta_1 ] \text{Pr}[ \theta = \theta_1 ]}{\text{Pr}[ \text{high p-value}]}} \\ &= \frac{\text{Pr}[ \text{high p-value} \mid \theta = \theta_0 ]}{\text{Pr}[ \text{high p-value} \mid \theta = \theta_1 ]} \frac{\text{Pr}[ \theta = \theta_0 ]}{\text{Pr}[ \theta = \theta_1 ]} \end{align}\]
If a high \(p\)-value is much more likely under the null than when \(\theta = \theta_1\), then \(\theta_0\) may become much more plausible compared to \(\theta_1\).
We know that plug-in estimators \(\hat{\theta}\) are normally distributed (under mild regularity conditions).
In many cases we can prove unbiasedness (\({\textrm E}[\hat{\theta}] = \theta\)) or consistency (\(\hat{\theta} \rightarrow \theta\)) (e.g. sample variance)
But what about the variance \({\textrm V}[\hat{\theta}]\)? (necessary for CIs, \(p\)-values)
We proved that \({\textrm V}[\overline{X}] = \frac{{\textrm V}[X]}{n}\) given iid samples, but what about
Example: correlation between two variables, ratio of two means
The bootstrap is a very general solution for estimating \({\textrm V}[\hat{\theta}]\).
\({\textrm V}[\hat{\theta}]\) describes the variance of \(\hat{\theta}\) across samples of size \(n\).
Problem: We have only one sample of size \(n\).
Bootstrap solution:
[1] 4.333333
[1] 2 5 2 2 6 3
[1] 3.333333
[1] 2 4 3 3 6 4
[1] 3.666667
[1] 5 6 6 5 6 5
[1] 5.5
The bootstrap is a plug-in estimator!
Compare to our approach to estimating \({\textrm V}[\overline{X}]\) previously:
var() in sample instead of populationWe don’t need bootstrap for \(V[\overline{X}]\), but it will work for (almost) anything.
Let’s get a 90% confidence interval for the mean of env (jobs/environment tradeoff) in 2012 CCES.
Earlier, we learned this approach to estimating \(\sigma[\overline{X}]\):
This also works:
The bootstrap approach:
dat \(m\) timesI stored \(m=1000\) sample means in samp_means.
Very close to other solution!
Our estimand \(\theta\): correlation between env and aa in 2012 CCES (CCES is population)
Our estimator \(\hat{\theta}\): correlation in a sample of \(n\) rows
Variance of our estimator \({\textrm V}[\hat{\theta}]\):
cor(env, aa) in each onecor(env, aa) in each onecor(env, aa) in each onecor(env, aa) in each oneNaive bootstrap: Resample rows of the dataset with replacement (i.e. the method above).
Block bootstrap: For grouped data (e.g. students in schools), resample groups rather than dataset rows
Bayesian bootstrap: Keep rows of dataset same but draw random reweightings
Residual bootstrap: Keep rows of dataset same but resample residuals, i.e.
Wild bootstrap (unrestricted): Keep rows of dataset same but rescale residuals, e.g. by draws from \(\{-1, 1\}\) (equal probability) or \(N(0, 1)\)
So far, we have focused on uncertainty that comes from sampling from a population: we don’t know about the units not in the sample.
In causal inference, we also care about uncertainty that comes from the assignment of treatment: we don’t know some of the potential outcomes, i.e. outcomes for a given unit if it had each treatment
Sharp null hypothesis is that treatment does not affect outcomes.
Under sharp null, we do know the missing potential outcomes: they are the same as observed potential outcomes.
So:
Get a \(p\)-value by comparing observed treatment effect to distribution of treatment effects under sharp null.